The present invention relates generally to voice over IP networks and more particularly relates to an apparatus and method of performing remote echo cancellation for the local endpoint of a connection.
Currently, there is a growing trend to converge voice and data networks so that both utilize the same network infrastructure. The currently available systems that combine voice and data have limited applications and scope. An example is Automatic Call Distribution (ACD), which permits service agents in call centers to access customer files in conjunction with incoming telephone calls. ACD centers, however, remain costly and difficult to deploy, requiring custom systems integration in most cases. Another example is the voice logging/auditing system used by emergency call centers (e.g., 911) and financial institutions. Deployment has been limited due to the limited scalability of the system since voice is on one network and data is on another, both tied together by awkward database linkages.
The aim of IP telephony is to provision voice over IP based networks in both the local area network (LAN) and the wide area network (WAN). Currently, voice and data generally flow over separate networks, the goal is to transmit them both over a single medium and on a single network.
A block diagram illustrating example separate prior art data and voice networks is shown in FIG. 1. The LAN portion, generally referenced 10, comprises the LAN cabling infrastructure, routers, switches and gateways 12 and one or more network devices connected to the LAN. Examples of typical network devices include servers 14, workstations 16 and printers (not shown). The voice portion, generally referenced 20, has at its core a private branch exchange (PBX) 24 which comprises one or more trunk line interfaces and one or more telephone and/or facsimile extension interfaces. The PBX is connected to the public switched telephone network (PSTN) 22 via one or more trunk lines 28, e.g., analog T1, E1, T3, ISDN, etc. A plurality of user telephones 26 and one or more facsimile machines 27 are also connected directly to the PBX via phone line extensions 29.
The paradigm currently in wide spread use consists of circuit switched fabric 20 for voice networks and a completely separate LAN infrastructure 10 for data. Most enterprises today use proprietary PBX equipment for voice traffic.
An increasingly common IP telephony paradigm consists of telephone and data tightly coupled on IP packet based, switched, multimedia networks where voice and data share a common transport mechanism. It is expected that this paradigm will spur the development of a wealth of new applications that take advantage of the simultaneous delivery of voice and data over a single unified fabric.
A block diagram illustrating a voice over an IP network where voice and data share a common infrastructure is shown in FIG. 2. The IP telephony system, generally referenced 30, comprises, a LAN infrastructure represented by an Ethernet switch 32, a router, one or more telephones 36, workstations 34, a gateway 42, a gatekeeper 46, a PBX 33 with a LAN interface port and a Layer 3 switch 38. The key components of an IP telephony system 30 are the modified desktop, gatekeeper and gateway entities. For the desktop, users may have an Ethernet phone 36 that plugs into an Ethernet RJ-45 jack or a handset or headset 35 that plugs into a PC 37.
Today, all LAN based telephony systems need to connect to the PSTN 44. The gateway is the entity that is specifically designed to convert voice from the IP domain to the PSTN domain. The gatekeeper is primarily the IP telephony equivalent of the PBX in the PSTN world.
Typically, the IP telephony traffic is supported by a packet-based infrastructure such as an Ethernet network but a circuit-based infrastructure can be used as well with some provisions (e.g., ATM LAN emulation on ATM networks). Telephony calls traversing the intranet may pass through a Layer 3 switch 38 or a router (not shown) connecting a corporate intranet 40. The Layer 3 switch and the router should support Quality of Service (QoS) features such as IEEE 802.1p and 802.1Q and Resource Reservation Protocol (RSVP).
The International Telecommunications Union (ITU-T) Telecommunications Standardization Sector has issued a number of standards related to telecommunications. The Series H standards deals with audiovisual and multimedia systems and describes standards for systems and terminal equipment for audiovisual services. The H.323 standard is an umbrella standard that covers various audio and video encoding standards. Related standards include H.225.0 that covers media stream packetization and call signaling protocols and H.245 that covers audio and video capability exchange, management of logical channels and transport of control and indication signals. Details describing these standards can be found in ITU-T Recommendation H.323 (Draft 4 August 1999), ITU-T Recommendation H.225.0 (February 1998) and ITU-T Recommendation H.245 (Jun. 3, 1999).
A block diagram illustrating example prior art H.323 compliant terminal equipment is shown in FIG. 3. The H.323 terminal 50 comprises a video codec 52, audio codec 54, system control 56 and H.225.0 layer 64. The system control comprises H.245 control 58, call control 60 and Registration, Admission and Status (RAS) control 62.
Attached video equipment 66 includes any type of video equipment, such as cameras and monitors including their control and selection, and various video processing equipment. Attached audio equipment 70 includes devices such as those providing voice activation sensing, microphones, loudspeakers, telephone instruments and microphone mixers. Data applications and associated user interfaces 72 such as those that use the T.120 real time audiographics conferencing standard or other data services over the data channel. The attached system control and user interface 74 provides the human user interface for system control. The network interface 68 provides the interface to the IP based network.
The video codec 52 functions to encode video signals from the video source (e.g., video camera) for transmission over the network and to decode the received video data for output to a video display. If a terminal incorporates video communications, it must be capable of encoding and decoding video information in accordance with H.261. A terminal may also optionally support encoding and decoding video in accordance with other recommendations such as H.263.
The audio codec 54 functions to encode audio signals from the audio source (e.g., (microphone) for transmission over the network and to decode the received audio data for output to a loudspeaker. All H.323 audio terminals must be capable of encoding and decoding speech in accordance with G.711 including both A-law and xcexc-law encoding. Other types of audio that may be supported include G.722, G.723, G.728 and G.729.
The data channel supports telematic application such as electronic whiteboards, still image transfer, file exchange, database access, real time audiographics conferencing (T.120), etc. The system control unit 56 provides services as defined in the H.245 and H.225.0 standards. For example, the system control unit provides signaling for proper operation of the H.323 terminal, call control, capability exchange, signaling of commands and indications and messaging to describe the content of logical channels. The H.225.0 Layer 64 is operative to format the transmitted video, audio, data and control streams into messages for output to the network interface. It also functions to retrieve the received video, audio, data and control steams from messages received from the network interface 68.
The gateway functions to convert voice from the IP domain to the PSTN domain. In particular, it converts IP packetized voice to a format that can be accepted by the PSTN. The actual format depends of the type of media and protocol used for connecting to the PSTN (e.g., T1, E1, ISDN BRI, ISDN PRI, analog lines, etc.). The gateway provides the appropriate translation between different video, audio and data transmission formats and between different communications procedures and medias.
Note that since the digitization format for voice on the IP packet network is often diferent than on the PSTN, the gateway needs to provide this type of conversion which is known as transcoding. Note also that gateway also function to pass singaling information such as dial tone, busy tone etc. Typical connections supported by the gateway include analog, T1, E1, ISDN, frame relay and ATM at OC-3and higher rates. Additional function performed by the gateway include call setup and clearing on both the network side and the PSTN side. The gateway may be omitted if communication with the PSTN is not required.
The gatekeeper functions to provide call control services, address translation services, call routing services, call authorization services, billing, bandwidth management and telephony supplementary services like call forwarding and call transfer to terminal endpoints on the network. It is primarily designed to be the IP telephony equivalent of the PBX. Logical endpoints register themselves with the gatekeeper before attempting to bring up a session. The gatekeeper may deny a request to bring up a session or may grant the request at a reduced data rate. This is particularly relevant to video connections that typically consume huge amounts of bandwidth for a high quality connection.
Call control signaling is optional as the gatekeeper may choose to complete the call signaling with the H.323 endpoints and process the call signaling or it may direct the endpoints to connect to the call signaling channel directly, the gatekeeper thus avoiding handling the H.225.0 call control signals.
Through the use of H.225.0 signaling, the gatekeeper may reject calls from a terminal due to authorization failure. The reasons for rejection may include restricted access to or from particular terminals or gateways, or restricted access during certain time periods.
Bandwidth management entails controlling the number of H.323 terminals that are allowed to simultaneously access the network. Via H.225.0 signaling, the gatekeeper may reject calls from a terminal due to bandwidth limitations. This may occur if the gatekeeper determines that there is insufficient bandwidth available on the network to support the call.
The call management function performed by the gatekeeper includes maintaining a list of currently active H.323 calls. This information is used to indicate that a terminal is busy and to provide information for the bandwidth management function.
The gatekeeper also provides address translation whereby an alias address is translated to a Transport Address. This is performed using a translation table that is updated using Registration messages, for example.
The H.225.0 standard dictates the usage of the Real-time Transport Protocol (RTP) which is defined by the IETF in RFC 1889 for conveying data between the call endpoints and for monitoring the network congestion. The RTP protocol defines the RTP packet structure that includes two parts: the RTP packet header part and the RTP packet payload part. The RTP packet header includes several fields. Among those fields, are the payload type identification field, the sequence numbering field and the time stamping field. Typically, applications encapsulate RTP in a UDP packet. UDP/IP is an unreliable transport mechanism and therefore there is no guarantee that the RTP packet would reach its destination. RTP may, however, be used with other suitable underlying network or transport protocols.
RTP does not itself provide any mechanism to ensure timely delivery or other QoS guarantees, but relies on lower layer services to do so. It also does not guarantee delivery, nor does it assume that the underlying network is reliable and delivers packets in sequence. RTP includes sequence numbers and timestamps in the packet to allow the receiver to reconstruct the sender""s packet sequence and timing.
RTP is intended to be flexible so as to provide the information required by a particular application. Unlike conventional protocols in which additional functions may be accommodated by making the protocol more general or by adding an option mechanism that requires parsing, RTP can he tailored through modifications and/or additions to the headers.
The RTP Control Protocol (RTCP) functions to periodically transmit control packets to all participants in a session. The primary function of RTCP is to provide feedback on the quality of the data distribution that is useful for monitoring network congestion. The RTCP protocol is designed to monitor the quality of service and to convey information about the participants in an on-going session. RTCP also carries a transport level identifier for an RTP source called the canonical name or CNAME. Receivers requite the CNAME to associate multiple data streams from a given participant in a set of related RTP sessions. The RTCP protocol can also be used to convey session control information such as participant identification. Each RTCP packet begins with a fixed header followed by structured elements of variable length. Note that the signaling/control information carried in the RTCP packets is transmitted using the TCP/IP reliable protocol.
Also under the H.323 protocol umbrella are a number of standards for voice codecs including for example, G.711, G.729, G.729.1 and G.723.1.
Call signaling encompasses the messages and procedures used to establish a call, request changes in bandwidth of the call, get status of the endpoints in the call and disconnect the call. Call signaling uses messages defined in the H.225.0 standard. In particular, the RAS signaling function uses H.225.0 messages to perform registration, admissions, bandwidth changes, status and disengage procedures between endpoints and Gatekeepers. The RAS Signaling Channel is independent from the Call Signaling Channel and the H.245 Control Channel.
Each H.323 entity has at least one network address that uniquely identifies the H.323 entity on the network. For each network address, each H.323 entity may have several TSAP identifiers that enable the multiplexing or several channels sharing the same network address. Endpoints have one well-known TSAP identifier known as the Call Signaling Channel TSAP Identifier. In addition, Gatekeepers also have one well-known TSAP identifier defined as the RAS Channel TSAP Identifier, and one well-known multicast address defined as the Discovery Multicast Address. Endpoints and H.323 entities use dynamic TSAP Identifiers for the H.245 Control Channel, Audio Channels, Video Channels, and Data Channels while the Gatekeeper uses a dynamic TSAP Identifier for Call Signaling Channels.
Further, an endpoint may have one or more alias addresses associated with it. An alias address represents the endpoint and provides an alternate method of addressing the endpoint. It is important to note that an endpoint may have more than one alias address that translates to the same TSAP. The alias may comprise, for example, private telephone numbers, E.164 numbers, any alphanumeric string that may represent a name, e-mail address, etc. In addition, the alias may comprise a MAC address, IP address, ATM address, access token, DNS address, TSAP as IP address concatenated with a port number or name alias. Note that alias addresses are unique within a zone and that gatekeepers do not have alias addresses.
When there is a Gatekeeper in the network, the calling endpoint addresses the called endpoint by its Call Signaling Channel Transport Address or by its alias address. The Gatekeeper translates the latter into a Call Signaling Channel Transport Address.
An endpoint joins a zone via the registration process whereby it informs the Gatekeeper of its Transport Addresses and one or more associated alias addresses. Note that registration must take place before any calls are attempted. When endpoints are powered up, they look on the network for the Gatekeeper and once found, they register their TSAP and one or more aliases with therewith.
In LAN Telephony applications, the voice samples generated are packed within RTP packets that are then encapsulated within UDP/IP packets. The UDP packets that travel over an IP network may, however, be delayed, dropped or arrive out or order from their original transmission sequence depending on the degree of network congestion. Therefore, the frequency in which the packets arrive at the receive side is not constant.
In order to combat the variable delay problems, many devices implement a jitter buffet on the receive side. If packets are only delayed within the network, arriving at the receiver before the jitter buffer underflows, the receive side will hear the sound as it was original transmitted by the local endpoint. If, however, packets are dropped or packets are delayed too much and the jitter buffer underflows (i.e. becomes empty), the receiving device either (1) replays the last packet received or (2) it injects a silence.
Thus, in the event packets are dropped or are delayed excessively causing jitter buffer underflow, the sound that is played on the receive side is not the original sound that was transmitted.
As in most voice communication devices, e.g., telephone, etc., a portion of the voice that is played on the receive side is returned to the transmitting side as an undesirable echo by the transmitter portion of the device. There are several sources that cause this undesirable phenomenon. The first source is the acoustic echo made up of sound waves produced by the loudspeaker that are reflected by the room walls and other objects in the room towards the microphone that records them. Another source of echo is the magnetic flux effects of the hybrid circuit in the telephone set and at the Central Office (CO). A 4 to 2 line hybrid is located in the telephone set to merge both transmit and receive directions onto a single copper pair wire. A corresponding 2 to 4 line hybrid is located at the CO to convert the single line into separate transmit and receive circuits. The magnetic flux of the receive circuitry passes through the transmit circuit coils thus causing the transmit circuit to record what is played on the receive circuit.
Some echo or feedback is desirable, however, such as when speaking on the telephone and the speaker heats her/his own voice through the handset. In this case, a small portion of the voice from the microphone is intentionally fed back to the speaker element. This intentional echo is injected locally from the microphone towards the speaker and is never sent lo the remote side.
In communication systems adapted to transfer voice, the quality of the voice is sensitive, among other things, to the round trip delay. If the round trip delay is less then 300 ms, the returned echo will not be bothersome to users. If, however, the round trip delay is greater than approximately 300 ms, the returned echo becomes noticeable to most users. In the IP telephony world there exist several sources that contribute to the round trip delay. First, each end point collects several samples until it fills an RTP packet, thus delaying the first samples. The packet is then encapsulated within a UDP/IP/Ethernet packet (und is sent over the network. The packet traverses through the network passing through one or more routers and switches, where each hop adds to the overall delay. Finally, it arrives at the remote endpoint where it is delayed in a jitter buffer until it is played. At the remote endpoint, the played sample returns as an echo with the voice that is now recorded. The round trip delay is twice the time it took from the time the sample is recorded until it is played at the remote end.
Each endpoint must, therefore, be adapted to remove this echo if the round trip delay is more then 300 ms. The echo is always removed locally whereby each end of a connection is adapted to subtract the echo from the signal it transmits Lo the other side. In the IP telephony world the echo must be removed locally, since the echo is generated from the sound that is played which may be different from the sound that was originally transmitted. Thus, each endpoint must incorporate the necessary means for removing the echo.
Typically, an endptoint incorporates one or more specialized powerful processors such as digital signal processors (DSPs) to perform the echo cancellation. A disadvantage is that these processors and their associated circuitry are costly thus increasing the resultant cost and design complexity of any device incorporating them.
The present invention provides an apparatus for and a method of remote echo cancellation in packet based telephony systems. Using the present invention, one or both endpoints in a connection do not need to perform the complex and processor intensive task of echo cancellalion. Utilizing the present invention the remote end of a connection is adapted to perform echo cancellation algorithms for both itself and the endpoint at the other end of the connection. Alternatively, a third party device serving as a transit point for the RTP packet stream can be adapted to perform the method of the present invention for one or both endpoints.
The remote endpoint (or third party device) is provided knowledge of the actual audio played on the other side (or local side) and a means for synchronizing this audio stream to the audio stream that was concurrently recorded by the local endpoint. The remote endpoint must know what was played at the local endpoint in order to accurately cancel the echo from the audio samples generated and sent by the local endpoint. The remote endpoint must estimate the echo function on the local endpoint.
To perform echo cancellation, the remote endpoint needs to know, for each data sample recorded by the local endpoint, what data sample from the remote endpoint the local endpoint played at that moment in time. In addition, the remote endpoint needs to know the several data samples that preceded the recorded data sample. The remote endpoint (or third party) is provided knowledge of the audio played on the local end of the connection via information transmitted in the header and header extension portions of the RTP packets and via the knowledge or the number of samples in the payload part of the RTP packet. There are two methods by which the local endpoint can notify the remote endpoint about which remote endpoint samples were played when the samples in the data packet were recorded: the first method is by using timestamps and the second method is by using RTP packet sequence numbers and offset pointers into the RTP packets.
In the timestamp method, the other endpoint (i.e. the local endpoint) is adapted to include the timestamp of the packet of audio that is played, with the packet of data samples sent to the remote endpoint. Thus, two timestamps are sent in the RTP packet including (1) a first timestamp of the data samples generated by the local endpoint (this timestamp value is taken when the first sample in the packet is taken) and (2) a second timestamp of the packet received from the remote endpoint and played at a point in time when the first sample of the local endpoint packet is generated.
The local endpoint is operative to track the timestamp of the data samples received encapsulated in RTP packets sent from the remote endpoint. These data samples are subsequently played by the local endpoint through its associated speaker. The data samples generated by the local endpoint are timestamped and placed in RTP packets. In addition, the timestamp of the data samples played by the local endpoint at that moment in time is also placed in the extension portion of the header of the RTP packet sent to the remote endpoint.
If the last packet received was replayed, an indication is placed in the header. extension of the packet that comprises the timestamp of the most recently received RTP packet. If a silence is played, a zero is placed in the header extension. The completed RTP packet is then sent to the remote endpoint.
The timestamp from the header extension portion of the RTP packets received from the local endpoint is extracted. A timestamp equal to zero indicates that a silence was played at the local endpoint. If the timestamp extracted is equal to the previous timestamp sent by the local endpoint, then this indicates that the local endpoint replayed the last received packet.
Otherwise, the timestamp extracted from the header extension is the timestamp of the packet that was played on the local endpoint at a point in time corresponding to the timestamp of the data samples sent in the packet. Assuming the remote endpoint has an estimate of the echo function on the local endpoint, the remote endpoint perform echo cancellation using its knowledge of the data samples played on the local endpoint. The remote endpoint is adapted maintain a copy of the most recent packets sent to the local endpoint. Since it maintains a copy of the packets, only the timestamp need be sent from the local endpoint to uniquely identify a particular packet.
The sequence method is similar to the timestamp method with the difference being that endpoint A places the sequence number and the offset within the packet that was received from endpoint B and played at the time when the first sample of the RTP packet being built is taken. This is in place of sending a timestamp.
There is therefore provided in accordance with the present invention a method of performing echo cancellation on a remote device in a packet telephony system, the system supporting a connection between a first endpoint and a second endpoint, the method comprising the steps of tracking a second timestamp of data samples originating from the second endpoint that are played by the first endpoint, generating data samples on the first endpoint, sending to the remote device packets containing data samples generated by the first endpoint, a first timestamp corresponding thereto and the second timestamp of data samples from the second endpoint played by the first endpoint at that moment in time, placing an indication in the packet of data samples sent to the remote device, the indication operative to specify whether a packet, several packets, several sequential samples from the same packet or several sequential samples from different packets received by the first endpoint were replayed or that a silence was played, tracking the number of data samples in the packets received by the remote device and reconstructing on the remote device the signal played on the first endpoint using the first timestamp, the second timestamp, the number of samples in the packet, and the indication information and performing echo cancellation therewith.
There is also provided in accordance with the present invention an apparatus for performing echo cancellation on a remote device in a packet telephony system, the system supporting a connection between a first endpoint and a second endpoint comprising means for tracking a second timestamp of data samples originating from the second endpoint that are played by the first endpoint, means for generating data samples on the first endpoint, means for sending to the remote device packets containing data samples generated by the first endpoint, a first timestamp corresponding thereto and the second timestamp of data samples from the second endpoint played by the first endpoint at that moment in time, means for placing an indication in the packet of data samples sent to the remote device, the indication operative to specify whether a packet, several packets, several sequential samples from the same packet or several sequential samples from different packets received by the first endpoint were replayed or that a silence was played, means for tracking the number of data samples in the packets received by the remote device and means for reconstructing on the remote device the signal played on the first endpoint using the first timestamp, the second timestamp, the number of samples in the packet, and the indication information and performing echo cancellation therewith.