Perceptual coding is a new coding technology. A background-modeling-based coding technology is a representative of the perceptual coding, and is mainly applied to scenarios with nearly unchanged backgrounds, such as video surveillance, conference calls, and newscasts. In addition to using a background modeling method to effectively reduce a scenario redundancy in a background, the background-modeling-based coding technology uses a new Group of pictures (GOP) structure, and therefore can significantly increase an encoding compression rate based on a general coding technology.
GOP is a basic structure of a coded video stream, and is also a smallest unit that can be independently decoded in a video sequence. A conventional GOP includes one independently decoded frame and multiple non-independently decoded frames, and a quantity of video frames included in a GOP is called a length of the GOP. To improve a compression rate, the length of a GOP can be properly increased, but an excessively long length of a GOP causes a defect of deterioration in an error diffusion effect. The background-modeling-based coding technology improves a GOP structure. In an improved GOP structure, an independently decoded frame is a background frame (G frame), and non-independently decoded frames include a background update frame (S frame), a forward-predicted frame (P frame), and a bi-directional predicted frame (B frame). An S frame can reference only one G frame closest to the S frame, a P frame can reference a G frame that is before the P frame or reference an S frame that is close to the P frame, and a B frame can also reference a G frame that is before the B frame or reference an S frame that is close to the B frame. This improved GOP structure can suppress the error diffusion effect such that a quite long length of GOP can be used in the background-modeling-based coding technology to further improve coding efficiency.
Dynamic Adaptive Streaming over Hypertext Transfer Protocol (HTTP) (DASH) technology uses an HTTP manner to transfer media content to a user, and has become a development trend in the network video industry currently. A key to the technology is sequentially segmenting media content on a server into media segments, where time lengths of the media segments are the same, generally being two to ten seconds, and every media segment is corresponding to one HTTP network address such that a client can acquire the media segment from the address. A server provides a Media Presentation Description (MPD) file, which is used to record an HTTP manner of acquiring these media segments, and information about the playback period of the media segments. A media segment may be further divided into subsegments, and every subsegment includes several video frames. In DASH, a segment index (sidx) is defined to indicate a start location, in a media segment, of each subsegment in the media segment, where the sidx further includes playback duration of each subsegment, and location information of the first Stream Access Point (SAP) in each subsegment.
An existing DASH access procedure includes that a terminal first acquires an MPD file, determines, according to an access time point input by a user and the MPD file, a segment corresponding to the access time point, determines, according to a segment index sidx of the segment, a subsegment corresponding to the access time point and a location of the first SAP of the subsegment, starts decoding from a video frame corresponding to the first SAP, and then performs playback.
A GOP is quite long, and only one SAP is defined for the current GOP when a coded video stream based on background modeling coding is being transmitted using DASH, which means that an independently decoded frame in the GOP is defined as the SAP. Therefore, there may be a quite large difference between an access time point input by a user and a SAP at which actual access is performed. As a result, access accuracy is poor, and user access experience is affected.