Recently, digital set-top boxes (STBs) with local storage called digital video recorder (DVR) begin to penetrate TV households. With this new consumer device, television viewers can record broadcast programs into the local storage of their DVR in a digital video compression format such as MPEG-2. A DVR allows television viewers to watch programs in the way they want and when they want. Due to the nature of digitally recorded video, viewers now have the capability of directly accessing to a certain point of recorded programs in addition to the traditional video cassette recorder (VCR) controls such as fast forward and rewind. Furthermore, if segmentation metadata for a recorded program is available, viewers can browse the program by selecting some of predefined video segments within the recorded program and play highlights as well as summary of the recorded program. The metadata of the recorded program can be delivered to DVR by television broadcasters or third-party service providers. The delivered metadata can be stored in a local storage of DVR for later use by viewers. The metadata can be described in proprietary formats or in international open standard specifications such as MPEG-7 or TV-Anytime.
To provide DVR users with advanced features such as browsing of recorded TV programs, it is needed to develop a cost-effective method for efficiently indexing TV broadcast programs, delivering metadata to STB and efficient random accessing to sub-parts of the recorded programs in DVR.
Real-Time Indexing TV Programs
Consider a scenario, called “quick metadata service” on live broadcasting, where descriptive metadata of a broadcast program is also delivered to a DVR while the program is being recorded. In case of live broadcasting of sports games such as football, television viewers might want to selectively view highlight events of a game as well as plays of their favorite players while watching the live game. Without the metadata describing the program, it is not easy for viewers to locate the video segments corresponding to the highlight events or objects (players in case of sports games) by using the conventional controls such as fast forwarding. The metadata includes time positions such as start time positions, duration and textual descriptions for each video segment corresponding to semantically meaningful highlight events or objects. If the metadata is generated in real-time and incrementally delivered to viewers at a predefined interval or whenever new highlight event or object occurs, the metadata can then be stored at the local storage of DVR for more informative and interactive TV viewing experience such as the navigation of content by highlight events or objects. The metadata can also be delivered just one time immediately after its corresponding broadcast television program has finished.
One of the key components for the quick metadata service is a real-time indexing of broadcast television programs. Various methods have been proposed for real-time video indexing.
U.S. Pat. No. 6,278,446 (“Liou”), the entire disclosure of which is incorporated by reference herein, discloses a system for interactively indexing and browsing video with easy-to-use interfaces. Specifically, Liou teaches automatic indexing in conjunction with human interactions for verification and correction provides a meaningful video table of contents.
U.S. Pat. No. 6,360,234 (“Jain”), the entire disclosure of which is incorporated by reference herein, discloses a video cataloger system and method for capturing and indexing video in real-time or non-real time, and publishing intelligent video via the World Wide Web. In parallel to the indexing process, the system of Jain allows users to navigate through the video by using the index to go directly to the exact point of interest, rather than streaming it from start to finish.
The conventional methods can generate low-level metadata in real-time by decoding closed-caption texts, detecting and clustering shots, selecting key frames, recognizing faces or speech all of which are automatically performed and synchronized with video. However, with the current state-of-art technologies on image understanding and speech recognition, it is very difficult to accurately detect highlights and generate semantically meaningful and practically usable highlight summary of events or objects in real-time. That is, the conventional methods do not provide semantically meaningful and practically usable metadata in real-time or even in non real-time for the following reasons:
First, as described earlier, it is hard to automatically recognize diverse semantically meaningful highlights. For example, a keyword “touchdown” can be identified from decoded closed-caption texts in order to automatically find touchdown highlights, resulting in many false alarms. Therefore, generating semantically meaningful and practically usable highlights will still require the intervention of a human operator.
Second, the conventional methods do not provide an efficient way for manually marking distinguished highlights in real-time. Consider a case when a series of highlights occurs at short intervals. Since it takes time for a human operator to type in a title and extra textual description of a new highlight, there might be a possibility to miss the immediately following events.
The media localization within a given temporal video stream can be described using either the byte location information or the media time information that specifies a time point that is contained in media data. In other words, in order to describe the location of a specific video frame within a video stream, a byte offset, i.e. the number of bytes to be skipped from the beginning of the video stream can be used. Alternatively, a media time describing a relative time point from the beginning of the video stream can be used.
In U.S. Pat. No. 6,360,234 (“Jain”), to access a certain position of an encoded video stream, the relative time from the beginning of the encoded video stream file is used. In the case of a VOD (Video On Demand) through interactive Internet or high-speed network, the start and end positions of each video program can be defined unambiguously in terms of media time as zero and the length of the video program, respectively, since each program is stored in the form of a separate media file in the storage at the head end and, further, each video program is delivered through streaming on each client's demand. Thus, a user at the client side can gain access to the appropriate temporal positions or video frames within the selected video stream as described in the metadata. However, in the case of TV broadcasting, since a digital stream or analog signal is continuously broadcast, the start and end positions of each broadcast program are not clearly defined. Since a media time or byte offset are usually defined with reference to the start of a media file, it could be ambiguous to describe a specific temporal location of a broadcast program using media times or byte offsets in order to relate an interactive application or event, and access to a specific location within a video program.
U.S. Pat. No. 6,357,042 (“Anand”), the entire disclosure of which is incorporated by reference herein, discloses that an authoring system for interactive video has two or more authoring stations for providing authored metadata to be related to a main video data stream and a multiplexer for relating authored metadata from the authoring sources to the main video data stream. Specifically, Anand uses the PTS (Presentation Time Stamp) of video frames when the authoring stations annotate created metadata from main video, and the multiplexer relates the metadata to the main video stream. Thus, Anand uses a value of PTS for random access to a specific position of media stream.
The PTS is a field that may be present in a PES (Packetized Elementary Stream in defined in MPEG-2) packet header that indicates the time that a presentation unit is presented in the system target decoder. However, the use of PTS values is not appropriate especially for digitally broadcast media streams, because it requires parsing of PES layers, and thus it is computationally more expensive. Further, for scrambled broadcast media streams, it is necessary to descramble them in order to access to PESs that contains PTSs. The MPEG-2 System specification describes a scrambling mode of the transport stream (TS) packet payload containing PES where the payload shall be scrambled but the TS packet header, and the adaptation field, when present, shall not be scrambled. Thus, if a broadcast media stream is scrambled, the descrambling is needed to access the PTS located in TS payload.
The Multimedia Home Platform (MHP) defines a generic interface between interactive digital applications and the terminals on which those applications execute. According to http://www.mhp-interactive.org/tutorial/synchronization.html, the association of an application with a specific TV show requires synchronization of the behavior of the application to the action on screen. Since there is no real concept of media time for a broadcast MPEG-2 stream, MHP uses DSM-CC Normal Play Time (NPT) that is a time code embedded in a special descriptor in an MPEG-2 private section, and provides a known time reference for a piece of media. Although NPT values typically increase throughout a single piece of media if they are present, they may have discontinuities either forwards or backwards. Thus, even if a stream containing NPT is edited (either to be made shorter, or to have advertisements inserted) then NPT values will not need updating and will remain the same for that piece of media. However, one of the issues on the use of NPT values is whether it is being broadcast.
“A practical implementation of TV-Anytime on DVB (Digital Video Broadcasting) and the Internet” in www.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP020.pdf describes a segmentation scenario allowing a service provider to refer to different sub-parts of programs. The segmentation allows that segments in TV-Anytime metadata reference sub-parts of the program by time on an unambiguous, continuous time-line defined for the program. Thus, it was proposed that MPEG-2 DSM-CC NPT (Normal Playtime) should be for these time lines. It is required that both head ends and receiving terminal can handle NPT accurately.
U.S. patent application Publication. Pub. No. US 2001/0014210 A1 (“Kang”), the entire disclosure of which is incorporated by reference herein, discloses a personal TV with improved functions. Specifically, Kang, by using synchronized encoding and indexing allows users to intelligently navigate through the video by using the index to go directly to the exact point of interest, rather than streaming it from start to finish. Kang suggests the use of byte offset values of group of pictures (GOP: A GOP serves as a basic access unit, with an I-picture serving as an entry point to facilitate random access) for media localization. However, to generate an offset table that contains media times and their byte offset values of the corresponding GOPs, it would be computationally expensive to parse into the video PES in order to compute the values of GOP offset. Further, the process of descrambling is needed when a recorded media stream is scrambled. Alternatively, Kang specifies that GOP offset values can be transmitted. Kang's system generates an index file by capturing and analyzing the stream before the stream is input to the MPEG-2 stream transmitter in a broadcast system. It is required to install Kang's system at the location that is tightly connected to the broadcast system. Thus, the cost of Kang's scheme could be expensive and further it is a sensitive issue for the third parties to freely access the stream inside a broadcast system.
U.S. Pat. No. 5,986,692 (“Logan '692”), the entire disclosure of which is incorporated by reference herein, discloses a scheme for computer enhanced broadcast monitoring. A time stamp signal is generated at time-spaced intervals to be used a time-based index for broadcast signal.
U.S. Application 2002/0120925A1 (“Logan '925”) the entire disclosure of which is incorporated by reference herein, discloses a system for utilizing metadata created either at a central station or at each user's location. Logan '925 focuses on the automatic generation of metadata. In case of DVRs for analog broadcasting such as from Tivo and ReplayTV, the analog broadcast signal is digitized and then encoded in MPEG-2 and then the encoded stream is stored in the STB storage. The broadcast analog TV signal such NTSC (National Television Standards Committee) does not contain time information such as PTS and broadcasting time. Thus, for analog broadcasting, it is not obvious to devise a method for efficiently indexing analog TV broadcast programs based on an appropriate time line, delivering metadata to DVRs and random accessing to the specific positions of media streams described in the metadata in DVRs. In case of DVRs for digital broadcasting, it is still difficult to devise an efficient time-based index for video stream localization to be used both in indexer and DVR clients.
As such, there still remains a need of a system and method that provides cost-effective and efficient indexing, delivery of metadata and accessing to recorded media streams in DVRs for digital TV broadcast programs as well as analog TV broadcast programs.