The standard public switched telephone network (PSTN), which has been part of our daily life for more than a century, is designed to transmit toll-quality voice only. This design target has been inherited in most modern and fully digitized phone systems, such as digital private branch exchange (PBX) and voice over IP (VoIP) phones. As a result, these systems, i.e., the PSTN (whether implemented digitally or in analog circuitry), digital PBX, and VoIP, are only able to deliver analog signals in a relatively narrow frequency band, about 200-3500 Hz, as illustrated in FIG. 1. This bandwidth will be referred to herein as “narrow band” (NB).
An NB bandwidth is so small that the intelligibility of speech suffers frequently, not to mention the poor subjective quality of the audio. Moreover, with the entire bandwidth occupied and used up by voice, there is little room left for additional payload that can support other services and features. In order to improve the voice quality and intelligibility and/or to incorporate additional services and features, a larger frequency bandwidth is needed.
Over the past several decades, the PSTN has evolved from analog to digital, with many performance indices, such as switching and control, greatly improved. In addition, there are emerging fully digitized systems like digital PBX and VoIP. However, the bandwidth design target for the equipment of these systems, i.e., narrow band (NB) for transmitting toll-quality voice only, has not changed at all. Thus, the existing infrastructure, either PSTN, digital PBX, or VoIP, cannot be relied upon to provide a wider frequency band. Alternate solutions have to be investigated.
Many efforts have been made to extend the capacity of an NB channel given the limited physical bandwidth. Existing approaches, which will be described below, can be classified into the following categories: time or frequency division multiplexing; voice or audio encoding; simultaneous voice and data; and audio watermarking.
Time or frequency division multiplexing techniques are simple in that they place voice and the additional payload in regions that are different in time or frequency. For example in the well known calling line ID (CLID) display feature, which is now widely used in telephone services, information about the caller's identity is sent to the called party's terminal between the first and the second rings, a period in which there is no other signal on line. This information is then decoded and the caller's identity displayed on the called terminal. Another example is the call waiting feature in telephony, which provides an audible beep to a person while talking on line as an indication that a third party is trying to reach him/her. This beep replaces the voice the first party might be hearing, and thus can cause a voice interruption. These two examples are time-division multiplexing approaches. A typical terminal product that incorporates these features is Vista 390™, by Aastra Technologies Limited.
As a frequency-division multiplexing example, frequency components of voice can be limited to below 2 kHz and the band beyond that frequency can be used to transmit the additional payload. This frequency limiting operation further degrades the already-low voice quality and intelligibility associated with an NB channel. Another frequency-division multiplexing example makes use of both lower and upper frequency bands that are just beyond voice but still within the PSTN's capacity, although these bands may be narrow or even non-existent sometimes. With some built-in intelligence, the system first performs an initial testing of the channel condition then uses the result, together with a pre-stored user-selectable preference, to determine a trade-off between voice quality and rate of additional payload. Time and frequency division multiplexing approaches are simple and therefore are widely used. They inevitably cause voice interruption or degradation, or both.
Voice coding and decoding (vocoding) schemes have been developed with the advancement of the studies on speech production mechanisms and psycho-acoustics, as well as of the rapid development of digital signal processing (DSP) theory and technology. A traditional depiction of, the frequencies employed in narrowband telephony, such as using standard PSTN, digital PBX or VoIP, is shown in FIG. 1. Wide band (WB) telephony extends the frequency band of the NB telephony to 50 Hz and 7000 Hz at the low and high ends, respectively, providing a much better intelligibility and voice quality. Since the WB telephony cannot be implemented directly on an NB telephone network, compression schemes, such as ITU standards G.722, G.722.1, and G.722.2, have been developed to reduce the digital bit rate (number of digital bits needed per unit of time) to a level that is the same as, or lower than, that needed for transmitting NB voice. Other examples are audio coding schemes MPEG-1 and MPEG-2 that are based on a human perceptual model. They effectively reduce the bit rate as do the G.722, G.722.1, and G.722.2 WB vocoders, but with better performance, more flexibility, and more complexity.
All existing voice and audio coding, or compression, schemes operate in a digital domain, i.e., a coder at the transmitting end outputs digital bits, which a decoder at the receiving end inputs. Therefore with the PSTN case, a modulator/demodulator (modem) at each end of the connection is required in order to transmit and receive the digital bits over the analog channel. This modem is sometimes referred to as a “channel coding/decoding” device, because it convert between digital bits and proper waveforms on line. Thus to implement a voice/audio coding scheme on a PSTN system, one will need an implementation of the chosen voice/audio coding scheme, either hardware or firmware, and a modem device if used with a PSTN. Such an implementation can be quite complicated. Furthermore, it is not compatible with the existing terminal equipment in the PSTN case. That is, a conventional NB phone, denoted as a “plain ordinary telephone set” (POTS), is not able to communicate with such an implementation on the PSTN line because it is equipped with neither a voice/audio coding scheme nor a modem.
Another category of PSTN capacity extension schemes is called “simultaneous voice and data” (SVD), and is often used in dial-up modems that connect computers to the Internet through the PSTN.
In an example, the additional payload, i.e., data in the context of SVD, is modulated by a carrier to yield a signal with a very narrow band, around 2500 Hz. This is then mixed with the voice. The receiver uses a mechanism similar to an adaptive “decision feedback equalizer” (DFE) in data communications to recover the data and to subtract the carrier from the composite signal in order for the listener not to be annoyed. This technique depends on a properly converged DFE to arrive at a low bit error rate (BER), and a user with a POTS, which does not have a DFE to remove the carrier, will certainly be annoyed by the modulated data, since it is right in the voice band.
In a typical example of SVD, each symbol (unit of data transmission) of data is phase-shift keyed (PSK) so that it takes one of several discrete points in a two-dimensional symbol constellation diagram. The analog voice signal, with a peak magnitude limited to less than half the distance separating the symbols, is then added so that the combined signal consists of clouds, as opposed to dots, in the symbol constellation diagram. At the receiver, each data symbol is determined based on which discrete point in the constellation diagram it is closest to. The symbol is then subtracted from the combined signal in an attempt to recover the voice. This method reduces the dynamic range, hence the signal-to-noise ratio (SNR), of voice. Again, a terminal without an SVD-capable modem, such as POTS, cannot access the voice portion gracefully. To summarize, SVD approaches generally need SVD-capable modem hardware, which can be complicated and costly, and are not compatible with the conventional end-user equipment, e.g., a POTS.
Audio watermarking techniques are based on the concept of audio watermarking, in the context of embedding certain information in an audio stream in ways so that it is inaudible to the human ear. A most common category of audio watermarking techniques uses the concept of spread spectrum communications. Spread spectrum technology can be employed to turn the additional payload into a low level, noise-like, time sequence. The characteristics of the human auditory system (HAS) can also be used. The temporal and frequency masking thresholds, calculated by using the methods specified in MPEG audio coding standards, are used to shape the embedded sequence. Audio watermarking techniques based on spread spectrum technology are in general vulnerable to channel degradations such as filtering, and the amount of payload has to be very low (in the order of 20 bits per second of audio) in order for them to be acceptably robust.
Other audio watermarking techniques include: frequency division multiplexing, as discussed earlier; the use of phases of the signal's frequency components to bear the additional payload, since human ears are insensitive to absolute phase values; and embedding the additional payload as echoes of the original signal. Audio watermarking techniques are generally aimed at high security, i.e., low probability of being detected or removed by a potential attacker, and low payload rate. Furthermore, a drawback of most audio watermarking algorithms is that they experience a large processing latency. The preferred requirements for extending the NB capacity are just the opposite, namely a desire for a high payload rate and a short detection time. Security is considered less of an issue because the PSTN, digital PBX, or VoIP is not generally considered as a secured communications system.
It is, therefore, desirable to provide a scheme which can be easily implemented using current technology and which extends the capacity of an NB channel at a higher data rate than that which is achievable using conventional techniques.