Multimedia streams are linear by nature, but the content within is usually organized into chapters, when the content transitions from one subject to another. However, whereas the chapters in books are clearly specified, chapters in most video streams are not defined, especially for live programming. This is due to the fact that video streams have largely been consumed linearly, and therefore chapters are not essential to their consumption. However, with the advent of interactive modes of video consumption, starting with DVDs, personal video recorders, and IP-delivered videos, chapters are becoming an important part of navigating and discovery of video content.
There are multiple approaches to automatically finding chapter boundaries, including the use of video analysis for black frames, audio analysis for speakers and audio transition, textual analysis of the stream's transcripts, and combinations thereof. However, these methods are often specifically designed to analyze certain types of programming, such as newscasts or movies, but are ill-suited for analyzing the other myriads of programming in other genres. That is, while an existing prior art may be effective for detecting chapters for newscasts, its accuracy would degrade quickly for non-newscasts such as drama or reality shows. Due to this limitation, existing prior arts are unable to accurately detect chapters across all types of programming, and therefore can only be applied to provide interactive video consumption for a small subset of video streams and thus limiting their usefulness.
Therefore, a need exists for a method for automatically detecting chapter boundaries within multimedia streams that is robust across all types of programming, and is automated and efficient so it can perform this detection for live video streams as they are being broadcast.
Various prior art arrangements are discussed in the following U.S. prior art documents.
U.S. Pat. No. 6,961,954—uses multiple types of analysis to find potential chapter boundaries, and uses finite state automata (FSA) to determine actual chapters. The assumption is that each show follows a traversal through the states of the automata, which is manually constructed and therefore either brittle or has to be continuously updated manually to account for changes in chapter structures. Additionally, the prior art does not address how to expand beyond newscasts, since new FSAs would be needed per type of programming, and it is not obvious how to select the “correct” FSA for a given show when there are multiple ones to choose from.
U.S. Pat. No. 7,181,757—proposes a system for describing summaries of chapters in order for their retrieval and presentation. However, this prior art does not specify how these summaries are determined, other than a module for rules for selecting summaries, which are assumed to be manually edited for specific types of videos and therefore labor intensive and brittle.
U.S. Pat. No. 7,184,959—uses speaker identification to find chapter boundaries, plus additional analysis of video and text for chapter description and searches. The assumption is that chapters begin with anchors introducing them, and therefore is best suited for newscasts. It also requires a database of audio and visual samples of known anchors, and therefore would require on-going updates of the database to add new persons for the system to recognize.
U.S. Pat. No. 7,486,542—describes the retrieval and personalization of news clips via keyword queries. This prior art does not address how the chapters are determined, but instead focuses on presenting the detected chapters of newscasts to the users.
U.S. Pat. No. 7,646,960—describes a chapter detection method based on rate of change of “cells”, which are effectively frames within videos. The assumption is chapter boundaries occur when there is a rapid change in the visual differences between frames, which is not robust since there are many non-transitions with frames that rapidly change, and true-transitions where the frames do not rapidly change. This method is also computationally expensive since it has to maintain many cells and how they change throughout the video stream.
U.S. Pat. No. 7,877,774—describes detecting newscast versus commercial boundaries via audio analysis, by automatic speaker analysis to find anchorpersons. The assumption is that chapters always begin with the anchorpersons making the introduction, which limits its application to programs outside of newscasts.
U.S. Pat. No. 8,189,114—describes chapter boundary detection based on analysis of visual differences between frames. The assumption is that chapter boundaries have transition effects and visual dissimilarities, which would result in too many false positives since most such transitions are not chapter boundaries. This prior art compensates by adding other methods of analysis to find correlations, which greatly increases complexity and computational costs.
U.S. Pat. No. 8,230,343—describes collecting metadata about segment boundaries, and collecting human inputs to correct errors and refine segment boundaries. This prior art requires recruitment and participation of humans in editing the metadata, and is not suitable for live video streams.
U.S. Pat. No. 8,392,183—describes summarization of videos based on grouping of similar textual sections into chapters and subsequent condensation. The assumption is that there needs to have a significant change in the subjects in the transcript between all chapters, which isn't necessarily the case for most video programming, especially for fictional works like sitcoms and movies. Conversely, there are programs where their subjects do change within the same chapter, such as game shows and interviews, and therefore this prior art would create more chapters than desired.
U.S. Pat. No. 8,422,859—describes commercial detection based on audio transitions. The assumption is that there's usually a change in audio characteristics between programming and commercials, which is not robust enough between all types of programming and all commercial types.
U.S. Pat. No. 8,479,238—describes generating metadata of videos based on textual analysis of transcripts, and enables users to query clips containing certain keywords. This prior art focuses on the analysis and querying of segments after their identification, but does not specify how the boundaries are automatically determined. Therefore, this prior art is predicated on chapter detection having taken place first.
U.S. Pat. Nos. 8,630,536 & 8,995,820—describe probabilistic commercial detection via batch processing, which is not well suited to live broadcasts.