The Internet is becoming a primary method for distributing media content (e.g., video and audio or audio) and other information to end users. It is currently possible to download music, video, games, and other media information to computers, cell phones, and virtually any network capable device. The percentage of people accessing the Internet for media content is growing rapidly. The quality of the viewer experience is a key barrier to the growth of video viewing on-line. Consumer expectations for online video are set by their television and movie viewing experiences.
Audience numbers for streaming video on the web are rapidly growing, and there are a growing interest and demand for viewing video on the Internet. Streaming of data files or “streaming media” refers to technology that delivers sequential media content at a rate sufficient to present the media to a user at the originally anticipated playback speed without significant interruption. Unlike downloaded data of a media file, streamed data may be stored in memory until the data is played back and then subsequently deleted after a specified amount of time has passed.
Streaming media content over the Internet has some challenges, as compared to regular broadcasts over the air, satellite, or cable. One concern that arises in the context of encoding audio of the media content is the introduction of boundary artifacts when segmenting the video and audio into fixed-time portions. In one conventional approach, the audio is segmented into portions having a fixed-time duration that matches the fixed-time duration of the corresponding video, for example, two seconds. In this approach, the audio boundaries always align with the video boundaries. The conventional approach starts a new encode session of an audio codec to encode each audio portion for each content file, for example, using Low Complexity Advanced Audio Coding (AAC LC). By using a new encode session for each portion of audio, the audio codec interprets the beginning and end of the waveform as transitions from zero, resulting in a pop or click noise in the playback of the encoded portion at the portion boundaries, such as illustrated in FIG. 1. The pop or click noises are referred to as boundary artifacts. Also, the audio codec encodes the audio of the fixed-time duration according to a codec-enforced frame size. This also introduces boundary artifacts when the number of samples produced by the audio codec is not evenly divisible by the codec-enforced frame size.
FIG. 1 is a diagram illustrating an exemplary audio waveform 100 for two portions of audio using a conventional approach. The audio waveform 100 illustrates the transition from zero 102 between the first and second portions of video. When the audio codec has a fixed-frame size (referred to herein as a codec-enforced frame size), the audio coded requires that the last frame 104 be padded with zeros when the number of samples of the portion is not evenly divisible by the number of samples per frame according to the codec-enforced frame size. For example, when using a sampling rate of 48 kHz, there are 96,000 samples generated for an audio segment of two seconds. When dividing the number of samples, 96,000, by the number of samples per frame (e.g., 1024 samples for AAC LC and 2048 samples High Efficiency AAC (HE AAC)), the result is 93.75 frames. Since the number 93.75 is not an integer, the audio codec pads the last frame 104 with zeros. In this example, the last 256 samples of the last frame are given a zero value. Although the zero values represents silent audio, the padding of the last frame with zeros results in a pop or click noise during playback of the encoded portion of audio at the portion boundaries. The transitions from zero 102 and the padded zeros in the last frame 104 introduce boundary artifacts. The introduction of boundary artifacts can decrease the overall quality of the audio, affecting the user's experience during playback of the media content.
Another conventional approach attempts to limit the number of boundary artifacts by using portions of audio having a longer duration in order to align with frame boundaries. However, by using a larger duration portion for the audio, the audio and video may be required to be packaged separately. This may present a drawback for streaming media content having audio and video, especially when the same media content is encoded at different quality levels, for example, as used in the context of adaptive streaming, which allows shifting between the different quality levels during playback of the media content.