FIG. 1 of the accompanying drawings schematically illustrates a typical audio transmitter/receiver system having a transmitter 100 and a receiver 106. The transmitter 100 has an encoder 102 and a packetiser 104. The receiver 106 has a depacketiser 108 and a decoder 110. The encoder 102 encodes input audio data, which may be audio data being stored at the transmitter 100 or audio data being received at the transmitter 100 from an external source (not shown). Encoding algorithms are well known in this field of technology and shall not be described in detail in this application. An example of an encoding algorithm is the ITU-T Recommendation G.711, the entire disclosure of which is incorporated herein by reference. An encoding algorithm may be used, for example, to reduce the quantity of data to be transmitted, i.e. a data compression encoding algorithm. The encoded audio data output by the encoder 102 is packetised by the packetiser 104. Packetisation is well known in this field of technology and shall not be described in further detail. The packetised audio data is then transmitted across a communication channel 112 (such as the Internet, a local area network, a wide area network, a metropolitan area network, wirelessly, by electrical or optic cabling, etc.) to the receiver 106, at which the depacketiser 108 performs an inverse operation to that performed by the packetiser 104. The depacketiser 108 outputs encoded audio data to the decoder 110, which then decodes the encoded audio data in an inverse operation to that performed by the encoder 102.
It is known that data packets (which shall also be referred to as frames within this application) can be lost, missed, corrupted or damaged during the transmission of the packetised data from the transmitter 100 to the receiver 106 over the communication channel 112. Such packets/frames shall be referred to as lost or missed packets/frames, although it will be appreciated that this term shall include corrupted or damaged packets/frames too. Several existing packet loss concealment algorithms (also known as frame erasure concealment algorithms) are known. Such packet loss concealment algorithms generate synthetic audio data in an attempt to estimate/simulate/regenerate/synthesise the audio data contained within the lost packet(s).
One such packet loss concealment algorithm is the algorithm described in the ITU-T Recommendation G.711 Appendix 1, the entire disclosure of which is incorporated herein by reference. This packet loss concealment algorithm shall be referred to as the G.711(A1) algorithm herein. The G.711(A1) algorithm shall not be described in full detail herein as it is well known to those skilled in this area of technology. However, a portion of it shall be described below with reference to FIGS. 2 and 3 of the accompanying drawings. This portion is described in particular at sections I.2.2, 1.2.3 and I.2.4 of the ITU-T Recommendation G.711 Appendix 1 document.
FIG. 2 is a flowchart showing the processing performed for the G.711(A1) algorithm when a first frame has been lost, i.e. there has been one or more received frames, but then a frame is lost. FIG. 3 is a schematic illustration of the audio data of the frames relevant for the processing performed in FIG. 2.
In FIG. 3, vertical dashed lines 300 are shown as dividing lines between a number of frames 302a-e of the audio signal. Frames 302a-d have been received whilst the frame 302e has been lost and needs to be synthesised (or regenerated). The audio data of the audio signal in the received frames 302a-d is represented by a thick line 304 in FIG. 3. In a typical application of the G.711(A1) algorithm, the audio data 304 will have been sampled at 8 kHz and will have been partitioned/packetised into 10 ms frames, i.e. each frame 302a-e is 80 audio samples long. However, it will be appreciated that other sampling frequencies and lengths of frames are possible. For example, the frames could be 5 ms or 20 ms long and could have been sampled at 16 kHz The description below with respect to FIGS. 2 and 3 will assume a sampling rate of 8 kHz and that the frames 302a-e are 10 ms long. However, the description below applies analogously to different sampling frequencies and frame lengths.
For each of the frames 302a-e, the G.711(A1) algorithm determines whether or not that frame is a lost frame. In the scenario illustrated in FIG. 3, after the G.711(A1) algorithm has processed the frame 302d, it determines that the next frame 302e is a lost frame. In this case the G.711(A1) algorithm proceeds to regenerate (or synthesise) the missing frame 302e as described below (with reference to both FIGS. 2 and 3).
At a step S200, the pitch period of the audio data 304 that have been received (in the frames 302a-d) is estimated. The pitch period of audio data is the position of the maximum value of autocorrelation, which in the case of speech signals corresponds to the inverse of the fundamental frequency of the voice. However, this definition as the position of the maximum value of autocorrelation applies to both voice and non-voice data.
To estimate the pitch period, a normalised cross-correlation is performed of the most recent received 20 ms (160 samples) of audio data 304 (i.e. the 20 ms of audio data 304 just prior to current lost frame 302e) at taps from 5 ms (40 samples back from the current lost frame 302e) to 15 ms (120 samples back from the current lost frame 302e). In FIG. 3, an arrow 306 depicts the most recent 20 ms of audio data 304 and an arrow 308 depicts the range of audio data 304 against which this most recent 20 ms of audio data 304 is cross-correlated. The peak of the normalised cross-correlation is determined, and this provides the pitch period estimate. In FIG. 3, a dashed line 310 indicates the length of the pitch period relative to the end of the most recently received frame 302d. 
In some embodiments, this estimation of the pitch period is performed as a two-stage process. The first stage involves a coarse search for the pitch period, in which the relevant part of the most recent audio data undergoes a 2:1 decimation prior to the normalised cross-correlation, which results in an approximate value for the pitch period. The second stage involves a finer search for the pitch period, in which the normalised cross-correlation is perform (on the non-decimated audio data) in the region around the pitch period estimated by the coarse search. This reduces the amount of processing involved and increases the speed of finding the pitch period.
In other embodiments, the estimate of the pitch period is performed only using the above-mentioned coarse estimation.
It will be appreciated that other methods of estimating the pitch period can be used, as are well-known in this field of technology. For example, an average-magnitude-difference function could be used, which is well-known in this field of technology. The average-magnitude-difference function involves computing the sum of the magnitudes of the differences between the samples of a signal and the samples of a delayed version of that signal. The pitch period is then identified as occurring when a minimum value of this sum of differences occurs.
In order to avoid aliasing or other unwanted audio effects at the cross-over between the most recently received frame 302d and the regenerated frame 302e, at a step S202 an overlap-add (OLA) procedure is carried out. The audio data 304 of the most recently received frame 302d is modified by performing an OLA operation on its most recent ¼ pitch period. It will be appreciated that there are a variety of methods for, and options available for, performing this OLA operation. In one embodiment of the G.711(A1) algorithm, the most recent ¼ pitch period is multiplied by a downward sloping ramp, ranging from 1 to 0, (a ramp 312 in FIG. 3) and has added to it the most recent ¼ pitch period multiplied by an upward sloping ramp, ranging from 0 to 1 (a ramp 314 in FIG. 3). Whilst this embodiment makes use of triangular windows, other windows (such as Hanning windows) could be used instead.
The modified most recently received frame 302d is output instead of the originally received frame 302d. Hence, the output of this frame 302d preceding the current (lost) frame 302e must be delayed by a ¼ pitch period duration, so that the last ¼ pitch period of this most recently received frame 302d can be modified in the event that the following frame (frame 302e in FIG. 3) is lost. As the longest pitch period searched for is 120 samples, the output of the preceding frame 302d must be delayed by ¼×120 samples=30 samples (or 3.75 ms for 8 kHz sampled data). In other words, each frame 302 that is received must be delayed by 3.75 ms before it is output (to storage, for transmission, or to an audio port, for example).
To regenerate the lost frame 302e, at a step S204, the audio data 304 of the most recent pitch period is repeated as often as is necessary to fill the 10 ms of the lost frame 302e. The number of repetitions of the pitch period depends on the length of the frame 302e and the length of the pitch period. For example, if the pitch period is 50 samples long, then the audio data 304 within the most recently received pitch period is repeated 80/50=1.6 times to regenerate the lost frame 302e. The number of repetitions of the pitch period is the number required to span the length of the lost frame 302e. 
Other proposed packet loss concealment algorithms involve regenerating a lost frame by using not only audio data from frames that have been received prior to the lost frame but also audio data from frames that have been received after the lost frame. Thus, these packet loss concealment algorithms also inherently impose a delay on the output of frames, as a regenerated frame cannot be output until a frame is received after the loss of frames.
Increasingly, there is a drive to decrease, or minimize, the delays introduced into audio processing paths. As more and more processing is applied to audio data, even small delays resulting from each processing step can compound to an unacceptably large delay of the audio data.
It is therefore an object of the present invention to provide a packet loss concealment algorithm that reduces, or minimizes, the delay introduced into the audio data.