In case of an error-prone network, every codec is trying to mitigate the artifacts due to those losses. The state of the art focuses on concealing the lost information by means of different methods, from simple muting or noise substitution to advanced methods such as prediction based on past good frames. One clearly overlooked great source of artifacts due to packet losses is located at the recovery (few good frames after a loss).
Due to the long term prediction often used in the case of speech codecs, the recovery artifact could be really severe and the error propagation could impact multiple following good frames. Some conventional technology tries to mitigate that problem, see, e.g., [1] and [2].
In the case of generic or audio codecs (any codec working in the transform domain), a lot of documentation about the concealment of frame losses like in [3] can be found. However, the available conventional technology does not focus on the recovery of frames. It is assumed that due to the nature of transform domain codec that the overlap and add will smooth out the transition artifacts. One good example is AAC-ELD (AAC-ELD=Advanced Audio Coding−Enhanced low delay; see [4]) used in Facetime for communication on IP network.
The first few frames after a frame loss are referred to as “recovery frames”. Conventional transform domain codecs do not appear to provide a special handling regarding the one or more recovery frames. Sometimes, annoying artifacts occur. An example for a problem that can happen when conducting recovery is a superposition of the concealed and of the good wave signal in the overlap and add part, which sometimes leads to annoying energy boosts.
Another problem is abrupt pitch changes on frame borders. An example for the case of speech signals is that when the pitch of the original signal changes and a frame loss occurs, the concealment method might predict the pitch at the end of a frame slightly wrong. This slightly wrong prediction might cause a jump of the pitch into the next good frame. Most of the known concealment methods do not even use prediction and only use a fix pitch base on the last valid pitch what could result in an even bigger mismatch with the first good frame. Some other methods use advanced prediction to reduce the drift, see, for example, TD-TCX PLC (TD=Time domain; TCX=Transform Coded Excitation; PLC=Packet Loss Concealment) in EVS (EVS=Enhanced Voice Services), see [5].
State of the art methods for modifying the pitch in a speech signal, such as TD-PSOLA (TD-PSOLA=Time Domain—Pitch Synchronous Overlap-Add), see [6] and [7], conduct prosody modifications on the speech signal, such as duration expansion/contraction (known as time-stretching) or conduct changing the fundamental frequency (the pitch). This is done, by decomposing a speech signal into short-term and pitch-synchronous analysis signals that are then repositioned on the time axis and juxtaposed progressively. However, the signal in the recovery frame is destroyed after the overlapping mechanism, when the pitch in the concealed frame and the pitch in the original signal differ. The TD-PSOLA mechanism would just reposition the artefact on the time axes, what is not suitable for recovery.