Modern communication systems facilitate human-to-human communication at a distance. Compared with other information-bearing signals, direct human/human interaction, such as face-to-face speech, is relatively lag-free, noise-free, even, and stable. Accordingly, typical communications systems that support human/human speech are configured to minimize, or at least reduce, signal distortions such as lag, noise, unevenness, and instability.
One popular approach to communication-at-a-distance is network telephony. Generally, network telephony systems exchange digital signals that represent speech. In some network telephony systems, the digital signals are the result of processing captured analog speech signals. In other network telephony systems, one or more of the digital signals are created by a machine as an original signal.
Network telephony has grown in popularity, in part because of the advent of a standard protocol, the “Session Initiation Protocol (SIP).” Generally, some SIP systems follow certain protocols to establish and maintain communications links for human-interactive media. In some cases, the human-interactive media is provided by software systems that create synthesized speech for delivery to the human user. In other cases, the human-interactive media is speech generated by another human, digitized and transmitted on the network. Thus, SIP systems support both human/human and human/machine communication.
These systems, however, suffer from numerous drawbacks. For example, humans are generally adept at filtering out signal distortion in ordinary face-to-face human/human communication, in part because visual information such as body language provides additional context. Humans are also much more skilled than machines in extracting meaning from distorted signals in human/human communication at-a-distance.
In human/machine communications, however, signal distortions can cause significant problems in both signal processing and content extraction. For example, instable connections, suffering from high drop-out, can greatly degrade human/machine communication. The initial setup period is particularly vulnerable to signal distortions, requiring efficient connection setup of a link between the human interface device and the machine hosting the target application for human/machine communication.
In typical prior art systems, Network Address Translation (NAT) is used to protect a private network from unwanted communications. Generally, NAT systems employ a list of private addresses for Endpoints within a private network. Typical NAT systems remap the private addresses to external Internet Protocol (IP) addresses, which are then used to contact devices outside of the private network. The NAT device keeps a list of the translated addresses and creates a secure connection to external devices, which do not know the internal addresses. The external devices communicate via the external IP addresses, through the NAT device, which forwards messages to the appropriate internal addresses. NAT systems sometimes also include port address translation (Network Address Port Translation, or NAPT), allowing for multiple devices to share the same external IP address, mapped to different ports.
But typical NAT systems cannot achieve the call setup efficiencies necessary to support the lag-free, noise-free, even, and consistent sound that meaningful, intensive human/machine interactions require. This drawback is even more pronounced in environments using Advanced Interactive Media Applications, which require high performance processors with full interactive application drivers including application logic, application states, and speech recognition resources. Without sufficient operational NAT performance, the call setup process for a human/machine communications link can be so untimely as to cause unsatisfactory connection times and/or increased signal distortion, leading to impaired communications.