1. Field of Invention
This invention relates to a method and apparatus for hiding data packet pre-buffering delays in multimedia streaming applications.
2. Description of Related Art
Currently, multimedia applications which transmit blocks of data from a source to a destination, e.g., datagram networks, without quality of service (QoS) guarantees have to build up a First-In, First-Out (FIFO) buffer of incoming packets to cope with problems associated with delay jitters, disordered packets, etc. These problems occur in the network layer; therefore, streaming applications are unable to eliminate them. Conventional multimedia streaming applications try to hide these delay jitters, disordered packets, etc. by pre-buffering data packets for several seconds before playing them out. However, this pre-buffering introduces a delay between selection and perception of a channel. For example, when a subscriber uses multimedia applications in datagram networks to play music, the subscriber may have to wait several seconds after a channel is selected before the subscriber hears any music. If existing implementations were to initiate play out immediately, the conventional multimedia streaming applications would generally not have any packets to play out. The user would, in the case of audio rendering, hear distortions such as pops and clicks or interspersed silence in the audio output.
This invention provides a special transient mode for rendering multimedia data in the first few seconds of play out, while minimizing both the distortion of the output and the delay between selection and play out caused by pre-buffering of data packets in multimedia streaming applications. Instead of pre-buffering all incoming data packets until a certain threshold is reached, the streaming application starts playing out some of the multimedia stream immediately after the arrival of the first data packet. Immediate play out of the first data packet, for example, results in minimum delay between channel selection and perception, thereby allowing a user to quickly scan through all available channels to quickly get a notion of the content. For example, when a subscriber selects a music channel when using multimedia applications in datagram networks, the subscriber can almost immediately hear a selected channel.
This immediate play out of data packets is done at a reduced speed with less than all incoming data packets. For example, if ten data packets are to be received, the first data packet can be played out immediately upon receipt. The remaining nine data packets can be pre-buffered in the background of this immediate play out. The reduced speed play out, e.g., slow mode, can continue until the buffer reaches a predetermined limit in the background. Instead of playing out every actual data packet in sequence after the initial data packet play out, fill packets can be inserted between the actual data packets.
The fill packets are packets synthesized from the earlier packets received from the channel or station and are used to stretch the initial few seconds of playback time in a pitch-preserving, or nearly pitch-preserving, fashion. For example, the first three seconds of received signals can be augmented by six seconds of synthesized signals which together result in a rendering out of over nine seconds of play out instead of the original three seconds.
Since data packets continue to arrive during the rendering of the augmented signals, e.g., during the excess six seconds in the example above, the rendering engine accumulates a buffer of packets which can allow the system to handle delay jitters and disordering of data packets. That is, after an initial interval of a few seconds in which the augmentation occurs, the number of data packets synthesized decreases as the buffer fills. Eventually, when the buffer is filled, synthesis ceases and the rendering proceeds as normal.
Audio and video signals generally contain considerable redundancy. The removal of such redundancy is the focus of modern source coding, i.e., signal compression, techniques. In many cases, there is redundancy not only within the frames encapsulated by a single packet, but also between frames encapsulated by two or more packets. The redundancy implies that in such cases a given packet may be predicted more or less from its neighboring packets.
This predictability may be calculated either in an objective classical signal-to-noise ratio (SNR) sense, or may be determined in a quasi-subjective way, via a perceptual model, e.g., as perceptual entropy, or in other ways previously developed or yet to be developed.
In order to reproduce a signal that is as close to an original signal as possible, the decision on which actual data packets to repeat as fill packets and how often is based on the signal""s perceptual entropy. The better the perceptual entropy, the less likely that the actual data packet will be repeated as a fill packet. In order for the synthesized packets used to augment the initial rendered packets to introduce minimal distortion into the rendered, e.g., audio, signal, fill packets are synthesized from the subset of initial packets in which the predictability is known to be high, either from side information in the stream or by inference from data in the packet.
Time stretching usually causes some loss in signal quality, but the insertion of fill packets in the special rendering mode offers a signal quality that is good enough for the user to readily get an idea of the content of the selected channel without experiencing a long delay, while at the same time building a buffer of accumulated packets that allow the rendering system to improve the quality to a level provided by standard stream buffering techniques. After a few seconds of the special rendering mode, during which the application has pre-buffered actual data packets in the background, the system can seamlessly switch from the reduced speed mode to the real play out mode without user involvement, for example.
These and other aspects of the invention will be apparent or obvious from the following description.