The invention relates to transmitting voice over the internet and in particular to reducing bandwidth required to transmit voice over the internet
Methods and apparatus for transmitting voice over internet protocol (VOIP) are known. VOIP services are offered by numerous companies and standards for internet telephony have been promulgated by the ITU-T. The ITU-T umbrella standard for VOIP is H.323 rev 2 (1998), xe2x80x9cPacket based multimedia communications systemsxe2x80x9d, the disclosure of which is incorporated herein by reference. An alternative umbrella standard referred to as xe2x80x9cSession Initiation Protocol (SIP)xe2x80x9d has recently been promulgated for internet telephony by the Internet Engineering Task Force (IETF).
In an internet telephony session between a first and second party, an internet connection is provided between communication equipment at the first party""s premises and communication equipment at the second party""s premises via their respective internet service providers. During the telephony session each party""s communication equipment generates a stream of samples of the party""s speech which is parsed into a sequence of groups referred to as xe2x80x9caudio framesxe2x80x9d. Each audio frame contains a predetermined desired number of samples and corresponds to a desired sampling period. The communication equipment encodes the samples in each audio frame in a constellation of symbols using an appropriate audio encoding scheme such as PCM, ADPCM or LPC.
Each encoded audio frame is encapsulated in a xe2x80x9creal time transport packetxe2x80x9d in accordance with a real time transport protocol. Under the ITU-T H323 internet telephony standard, the audio frame is encapsulated in an RTP packet in accordance with a real time protocol referred to by the acronym xe2x80x9cRTPxe2x80x9d. RTP is defined in Schulzrinne, et al., xe2x80x9cRTP: A Transport Prototcol for Real-Time Applicationsxe2x80x9d, RFC 1889, Internet Engineering Task Force, January 1996 the disclosure of which is incorporated herein by reference.
In accordance with RFC 1889, the real time transport packet, hereinafter referred to as an xe2x80x9cRTP packetxe2x80x9d, that encapsulates the audio frame comprises a header having a sequence number. The sequence number corresponds to the temporal order of the audio frame in the RTP packet relative to other audio frames in the sequence of audio frames generated by the communication equipment. Each RTP packet is in turn packaged in a data packet with a suitable data packet header according to an internet transport protocol. Typically, the internet transport protocol for xe2x80x9cRTP transmissionxe2x80x9d is UDP. The data packets are transmitted in a stream of data packets over the internet to the other party.
When the other party receives the stream of data packets, the other party""s communication equipment strips each data packet in the stream and its enclosed RTP packet of their respective headers to xe2x80x9cunloadxe2x80x9d the audio frame xe2x80x9cpayloadxe2x80x9d in the RTP packet. The communication equipment then concatenates the unloaded audio frames sequentially according to the sequence numbers of their respective RTP packets. The concatenated audio frames are decoded and converted to analogue audio signals to reproduce the speech of the party transmitting the data packets.
Transmission of data packets using UDP can be unreliable and data packets sent via UDP can disappear without a trace and never reach their intended destinations. A data packet can be lost for example if it passes through a network node that is overloaded and xe2x80x9cdecidesxe2x80x9d to dump excess traffic. The rate at which data packets are lost generally increases as a network becomes more congested.
To improve reliability and quality of internet telephony using RTP xe2x80x9con top ofxe2x80x9d UDP and reduce effects of data packet loss on internet telephony, redundancy is sometimes implemented in audio frame transmission between parties to an internet telephony session. With redundancy a same audio frame to be transmitted from one to the other of the parties participating in the internet telephony session is transmitted more than once to assure that it reaches its destination. A redundancy protocol has been promulgated in C. Perkins et al., xe2x80x9cRTP Payload for Redundant Audio Dataxe2x80x9d RTP 2198. Internet Engineering Task Force, September 1997 the disclosure of which is incorporated herein by reference.
While redundancy reduces vulnerability of data transmission to packet loss and improves reliability of data transmission, transmission of data with redundancy generally requires a bit-rate greater than a bit-rate required to transmit the data without redundancy. Redundant data transmission therefore utilizes a greater portion of channel capacity than non-redundant transmission. As a result, while redundancy provides some protection against data packet loss, redundancy tends to increase network congestion, which can in turn exacerbate the packet loss problem redundancy is intended to alleviate. Frugal use of redundancy is therefore generally advisable.
U.S. patent application Ser. No. 09/241,857, entitled xe2x80x9cMethod and Apparatus for Transmitting Packetsxe2x80x9d, the disclosure of which is incorporated in its entirety herein by reference, describes a method of implementing redundancy in audio and video data packet transmission over the internet. The method discloses inter alia, controlling use of redundancy in transmitting information over an internet channel responsive to transmission conditions over the channel so as to reduce channel capacity required to support data transmission with redundancy.
An aspect of some embodiments of the present invention relates to providing a method for transmitting voice over the internet with redundancy that can generally be implemented at average bit-rates that are lower than average bit-rates required by prior art methods of transmitting voice over the internet with redundancy. As a result, a VOIP redundancy method, in accordance with an embodiment of the present invention, generally uses less channel capacity than prior art VOIP redundancy methods.
In accordance with an embodiment of the present invention, the speech of a person participating in an internet telephony session with another person or persons is monitored to determine when the person is speaking and when the person is silent. In addition for periods, hereinafter referred to as xe2x80x9cvoice periodsxe2x80x9d, during which the person is speaking, the person""s speech is optionally analyzed to determine which portions of the voice periods are stationary.
A stationary portion of a speech period is a time period, having duration equal to duration of at least two audio frames into which a person""s speech is parsed for transmission, during which a power spectrum of the voice period is substantially constant. Stationary portions of a voice period are referred to as stationary intervals. Except for a first audio frame that falls entirely within a stationary interval, audio frames that fall entirely within a stationary interval are referred to as stationary audio frames. Audio frames that are not entirely within a stationary interval or audio frames which are a first audio frame completely within a stationary interval are referred to as non-stationary audio frames. By definition, stationary audio frames from a same stationary interval have a same spectrum. As a result any stationary audio frame in a stationary interval can be reconstructed from a previous audio frame in the stationary interval.
In some embodiments of the present invention, as in many prior art VOIP systems, both silent periods and voice periods of the person""s speech are encoded in audio frames and transmitted in data packets to the person or persons with whom the person is speaking. However, in accordance with an embodiment of the present invention, if redundancy is required during the telephony session to assure quality of voice transmission, redundancy is implemented only for voice periods of the person""s speech and optionally only for non-stationary audio frames of the voice periods. Redundancy is not implemented for the silent periods of the person""s conversation and optionally not implemented for stationary audio frames of voice periods.
If a stationary audio frame of a VOIP transmission gets lost, communication equipment receiving the VOIP transmission reconstructs the lost audio frame from a spectrum of an audio frame in a same stationary interval as the lost audio frame. Optionally, the audio frame is an audio frame preceding the lost stationary audio frame. Optionally, the lost audio frame is reconstructed from an audio frame immediately preceding the lost stationary audio frame. (It is noted an audio frame immediately preceding a stationary audio frame is either a stationary audio frame of a stationary interval in which the lost audio frame is located or a first audio frame of the stationary interval. In either case, the preceding audio frame has a same spectrum as the lost audio frame, and the lost audio frame can therefore be reconstructed therefrom.)
It is noted that some VOIP protocols do not encode silent periods of a person""s speech and instead transmit predetermined xe2x80x9ccomfort noisexe2x80x9d during the silent periods. In some embodiments of the present invention for which comfort noise is transmitted during silent periods of a person""s speech, if redundancy is required to provide quality of transmission, redundancy is implemented only for non-stationary audio frames of the person""s voice periods.
Redundancy coding only for voice periods of a person""s speech and optionally only for non-stationary audio frames of the voice periods, in accordance with embodiments of the present invention, is hereinafter referred to as xe2x80x9cvoice-selective redundancyxe2x80x9d. Prior art redundancy coding, which is implemented for both voice and silent periods of a person""s speech, is hereinafter referred to as xe2x80x9cnon-voice-selective redundancyxe2x80x9d.
As a result of implementing voice-selective redundancy, in accordance with an embodiment of the present invention, substantially less data has to be transmitted to support redundancy for VOIP and redundancy can be provided at bit-rates that are substantially less than bit-rates required by prior art VOIP redundancy methods. Transmitting voice using voice-selective redundancy, in accordance with an embodiment of the present invention, therefore uses less channel capacity and results in less channel congestion than transmitting voice using prior art non-selective redundancy.
For example, assume that to transmit a person""s speech over the internet with non-selective redundancy according to prior art requires a bit-rate that is 40% greater than that required for transmitting the person""s speech without redundancy. It is noted that a person""s speech is generally punctuated by relatively long and frequent periods of silence when conversing with another person, and in particular when conversing with another person over the telephone. On the average, a person conversing with another person over the telephone is substantially silent about 60% of the time and voice periods occupy only about 40% of the person""s speech. Furthermore, stationary intervals of voice periods may occupy on average as much as 50% of a person""s voice periods. Therefore, assuming that 60% of the person""s speech consists of periods of silence, implementing redundancy only for voice periods of the person""s speech, in accordance with an embodiment of the present invention, requires an average bit-rate that is only about 16% greater than the non-redundant bit-rate. If in addition, in accordance with an embodiment of the present invention, redundancy is implemented only for non-stationary audio frames, an average bit rate that is only about 8% greater than the non-redundant bit rate is required to support redundancy. The last result assumes that non-stationary audio frames account on average for about 50% of a voice period.
It is noted that voice-selective redundancy, in accordance with an embodiment of the present invention, provides substantially a same quality of voice transmission as non-voice selective redundancy. Loss of xe2x80x9csilentxe2x80x9d audio frames that encode silent periods of a person""s speech do not substantially affect perceived quality of reception. In addition, as noted above, lost stationary audio frames which encode portions of voice periods of a person""s speech are reconstructed from a spectrum of an audio frame optionally temporally adjacent to the lost audio frame. As a result, even though voice-selective redundancy, in accordance with an embodiment of the present invention, does not protect against loss of data packets carrying silent audio frames it provides a substantially same quality of speech transmission as non-voice-selective redundancy.
An aspect of some embodiments of the present invention relates to providing communication equipment for implementing VOIP with voice-selective redundancy, in accordance with an embodiment of the present invention.
Communication equipment for VOIP, in accordance with an embodiment of the present invention, comprises a controller and a voice monitor. The voice monitor monitors speech of a person using the communication equipment to identify silent periods and voice periods of the person""s speech. The voice periods are then analyzed to distinguish between stationary and non-stationary intervals. The monitor generates signals responsive to the person""s speech that it transmits to the controller to indicate when the silent periods, stationary intervals and non-stationary intervals occur. If redundancy is required to provide quality VOIP, the controller controls the communication equipment to implement redundancy during voice periods of the person""s speech and optionally, to implement redundancy only during non-stationary intervals of the voice periods.
There is therefore provided, in accordance with an embodiment of the present invention, a method for transmitting speech of a first person communicating with a second person via a packet switched network comprising: generating a stream of samples of the first person""s speech during the communication; parsing the sample stream into audio frames; determining which audio frames correspond to periods when the first person is speaking and which correspond to periods when the first person is silent; transmitting audio frames corresponding to silent periods and speaking periods of the first person""s speech; and transmitting at least some of the audio frames corresponding to speaking periods, but none of the audio frames corresponding to silent periods, at least twice.
Optionally, transmitting an audio frame corresponding to a speaking period at least twice comprises transmitting the audio frame at least twice only if a quality of transmission criterion indicates that the audio frame should be transmitted at least twice.
Optionally the quality of transmission criterion is a packet loss rate criterion and an audio frame is transmitted at least twice only if the packet loss rate over the network between the first and second persons exceeds a predetermined maximum.
Optionally, transmitting audio frames corresponding to silent periods and speaking periods comprises transmitting each of the audio frames into which the first person""s speech is parsed at least once.
In some embodiments of the present invention, transmitting at least some of the audio frames corresponding to speaking periods at least twice comprises: for each audio frame corresponding to a speaking period, determining that the audio frame is a stationary audio frame if it is an audio frame, but not the first audio frame, of a sequence of at least two consecutive audio frames for which the first person""s speech is stationary; and only if the audio frame is not a stationary audio frame, transmitting the audio frame at least twice.
Optionally, transmitting an audio frame corresponding to a speaking period at least twice comprises transmitting the audio frame at least twice only if a quality of transmission criterion indicates that the audio frame should be transmitted at least twice.
Optionally, the quality of transmission criterion is a packet loss rate criterion and an audio frame is transmitted at least twice only if the packet loss rate over the network between the first and second persons exceeds a predetermined maximum.
In some embodiments of the present invention, transmitting audio frames corresponding to silent periods and speaking periods comprises transmitting each of the audio frames into which the first person""s speech is parsed at least once.
There is further provided, in accordance with an embodiment of the present invention, a method for transmitting speech of a first person communicating with a second person via a packet switched network comprising: generating a stream of samples of the first person""s speech during the communication; parsing the sample stream into audio frames; determining which audio frames correspond to periods when the first person is speaking and which correspond to periods when the first person is silent; transmitting audio frames corresponding to silent periods and speaking periods of the first person""s speech; for each audio frame corresponding to a speaking period, determining that the audio frame is a stationary audio frame if it is an audio frame, but not the first audio frame, of a sequence of at least two consecutive audio frames for which the first person""s speech is stationary; and transmitting the audio frame at least twice if and only if it is not a stationary audio frame.
Optionally, transmitting audio frames corresponding to silent periods and speaking periods comprises transmitting each of the audio frames into which the first person""s speech is parsed at least once.
There is further provided, in accordance with an embodiment of the present invention, a method for transmitting speech of a first person communicating with a second person via a packet switched network comprising: generating a stream of samples of the first person""s speech during the communication; parsing the sample stream into audio frames; determining which audio frames correspond to periods when the first person is speaking and which correspond to periods when the first person is silent; for each audio frame corresponding to a speaking period, determining that the audio frame is a stationary audio frame if it is an audio frame, but not the first audio frame, of a sequence of at least two consecutive audio frames for which the first person""s speech is stationary; and transmitting the audio frame at least once if it is not a stationary audio frame and not transmitting the audio frame if it is a stationary audio frame.
Optionally, the method comprises not transmitting audio frames of the first person""s speech if the audio frames correspond to periods when the first person is silent.
Alternatively the method comprises, for each silent period optionally transmitting only a first audio frame of the silent period at least once.
There is further provided, in accordance with an embodiment of the present invention, apparatus for transmitting a person""s speech over a packet switched network comprising: transmission apparatus that generates audio frames of the person""s speech and transmits the audio frames over the network; a network sensor that determines whether audio frames should be transmitted more than once to meet a quality criteria of transmission; a voice monitor that determines during speech when the person is speaking and when the person is silent; and a controller that controls the transmission apparatus, wherein if the network monitor determines that an audio frame should be transmitted more than once, the controller controls the transmission apparatus to transmit the audio frame more than once only if the voice monitor indicates that the audio frame does not correspond to a time when the person is silent.
Optionally, the voice monitor determines whether an audio frame corresponding to a time at which the person is speaking corresponds to a time during which the person""s speech is stationary.
Optionally, if the voice monitor determines that an audio frame corresponds to a time at which the person""s speech is stationary, the controller controls the transmission apparatus to transmit the audio frame only once.